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ABSTRACT 

Summary: In light-sheet microscopy, overall image content and 
resolution are improved by acquiring and fusing multiple views of 
the sample from different directions. State-of-the-art multi-view (MV) 
deconvolution employs the point spread functions (PSF) of the 
different views to simultaneously fuse and deconvolve the images 
in 3D, but processing takes a multiple of the acquisition time and 
constitutes the bottleneck in the imaging pipeline. Here we show 
that MV deconvolution in 3D can finally be achieved in real-time by 
reslicing the acquired data and processing cross-sectional planes 
individually on the massively parallel architecture of a graphics 
processing unit (GPU). 

Availability: Source code and binaries are available on github 
I https://github.eom/bene51/i, native code under the repository 
’gpu_deconvolution’, Java wrappers implementing Fiji plugins under 
’SPIM_Reconstruction_Cuda’. 

Contact: bschmid(5)mpi-cbg.de 

Supplementary information: Supplementary data are available at 
the end of this document. 


1 INTRODUCTION 

MV imaging is particularly useful in light-sheet microscopy 
where consecutive views are acquired in short succession, 
allowing reconstruction of entire developing organisms without 
artifacts ( |Huisken et al\ |2004| l. Due to the low photo-toxicity 
in light sheet microscopy, time-lapse experiments are oftentimes 
run over days and TBs of data accumulate quickly. MV fusion 
is therefore particularly desirable to be performed in real-time 
to eliminate redundant information from different views. Best 
fusion results, however, are achieved by combining fusion with 
3D deconvolution ( [Swoger et al\ |2007[ |Verveer et al\ |2007[ |Wu| 
\et (2/.|[20T3] |. Although efficient Bayesian MV deconvolution based 
on the Richardson-Lucy (RL) algorithm has been shown recently 
to outperform existing methods in terms of fusion quality and 
convergence speed, it is still too slow for real-time processing of 
typical data volumes ( [Preibisch et a/.|[20T4] ). 

The RL deconvolution iterations consist only of convolutions 
and pixel-wise arithmetic operations and could therefore be 
signihcantly accelerated using dedicated hardware such as a 
GPU. The large memory requirements of MV deconvolution, 
however, exceed the limited resources of modern GPUs even for 
moderate data sizes (Supplementary Note 1). Previous attempts 
therefore required splitting the data into blocks of appropriate 
size. Each block then either had to be transferred to and from 
the GPU in each RL iteration ( [Preibisch et al\ |2014| |, or blocks 
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needed to share a considerable amount of overlap to avoid 
border artifacts ( |Temerinac-Ott et ^ [201 Ij i. Therefore, GPU- 
based implementations only achieved a three-times performance 
gain ( [Preibisch et a/.|[20T4] ). 


2 RESULTS 

The primary goal of MV fusion is the improvement of the poor 
axial resolution in a single 3D dataset using the superior lateral 
resolution of an additional, overlapping dataset, and not necessarily 
to improve resolution beyond the intrinsic lateral resolution. We 
therefore approximated the full 3D transfer function with a 2D 
PSF, neglecting one lateral component (along the rotation axis), and 
processed each plane orthogonal to the rotation axis independently 
(Fig. la). Memory requirements were thereby reduced by the 
number of lines read out from the camera chip, i.e. typically 
100-1000 fold (Fig. lb). This allowed us to implement the entire 
MV deconvolution on a GPU. Taking advantage of three CUDA 
(Compute Unihed Device Architecture) streams, we interleaved 
GPU computations with data transfers, such that not only expensive 
copying to and from GPU memory, but also reading and writing 
data from and to the hard drive came without additional cost 
(Supplementary Note 2). Compared to 3D MV deconvolution, with 
and without GPU support, we thereby reduced processing times by 
a factor of up to 25 and 75, respectively (Fig. Ic, Supplementary 
Table 1), while producing comparable results. 

We compared the results obtained by our plane-wise deconvolution 
to the methods commonly used in the light-sheet community, such 
as established 3D deconvolution ( Preibisch et a/.[|MT4] ), averaging 
and entropy-based fusion ( [Preibisch et al 2010[ ) (Fig. Id-i). Both 


averaging and entropy-based fusion were blurry and showed cross¬ 
shaped artifacts, originating from the elongated PSFs along the 
optical axes. 3D deconvolution as well as our plane-wise variant 
reduced artifacts and enhanced the contrast, thus truly improving the 
resolution in the fused data set. While our results were comparable 
to the slow full 3D deconvolution (Fig. lh,i; Supplementary Fig. 1), 
processing times and memory requirements were heavily reduced 
so that the entire deconvolution could be performed in real time. 

We provide our software as a C library that can be directly 
linked to camera acquisition software for real-time processing. To 
also beneht from the increased performance in post-processing, 
we additionally created Java wrappers to provide plugins for the 
open source image processing software Fiji ( [Schindelin et a/.[[2012[ l 
(Supplementary material). 
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Fig. 1. Plane-wise multi-view deconvolution concept and performance, (a) Concept of plane-wise deconvolution for two views. Each data set is resliced into 
planes orthogonal to the microscope’s rotation axis. Data sets are deconvolved plane-by-plane, (b) Memory requirements for traditional 3D and our plane-wise 
multi-view deconvolution, for various data sizes and numbers of views, on a logarithmic scale, (c) Execution times for plane-wise multi-view deconvolution, 
implemented on GPU and CPU, and 3D deconvolution, with and without GPU support. Memory requirements for 3D deconvolution timings for the 2048^ 
pixel data set were beyond the capabilities of our workstation, (d-i) Resulting images of a 9 hours post fertilization transgenic Tg(h2aJva:h2aJva-mCherry) 
zebrafish embryo, using different methods (view along the rotational axis, scale bar 100 pm, 10 pm in the inset): (d, e) acquired raw data, (f-i) fusion performed 
by (f) averaging, (g) entropy-weighted averaging, (h) 3D multi-view deconvolution and (i) plane-wise multi-view deconvolution (10 iterations). (Dell T6100, 
Intel E5-2630 @2.3 GHz 2 processors, 64 GB RAM; Graphics card: Nvidia GeForce GTX TITAN Black). 


3 VALIDATION 

Our plane-wise deconvolution approximates 3D deconvolution by 
neglecting the contribution of the PSF along the rotation axis. 
Using artificial data (Supplementary Fig. 2), we found that the 
validity of this approximation is independent of the amount of noise 
(Supplementary Fig. 3), but depends on the lateral extents of the 
PSF. Keeping its axial standard deviation fixed at eight pixels, a 
typical value measured on our microscopes, we found that up to 
a lateral standard deviation of 2-3 pixels, results from plane-wise 
and 3D deconvolution are undistinguishable (Supplementary Fig. 4). 
The measured lateral standard deviation of the PSF was typically 
between 1.5 and 1.8 pixels on our microscopes. 


4 CONCLUSION 

With the advent of first commercially available systems, light- 
sheet microscopy becomes more and more popular. Its photo¬ 
efficiency enables long time-lapse imaging of living samples to 
study fundamental questions in developmental biology. However, 
the huge data rates and enormous amounts of data it produces 
also open new challenges for data processing and handling. A 
key problem in light-sheet microscopy is the fusion of data 
recorded from multiple angles. In this paper, we have presented a 
new method that performs MV deconvolution plane-wise, which 
reduces memory requirements compared to existing methods and 
thus permits an entirely GPU-based implementation. The achieved 
acceleration makes MV deconvolution for the first time applicable 
in real-time without the need for data cropping or resampling. 
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Supplementary Material 


SUPPLEMENTARY NOTE 1: MEMORY REQUIREMENTS FOR MULTI-VIEW DECONVOLUTION 

The multi-view Richardson-Lucy deconvolution algorithm iteratively updates the current estimate of the true image by the following formula: 




vev 


* Pv 


* p.t 


Assuming the two convolutions are performed in Fourier space, the memory required per pixel (of the isotropic fused dataset) is 22 * * u + 12 
bytes, where v is the number of views: 


description 

data type 

memory requirement 
(bytes/pixel) 

kernel spectrum FFT(P^;) 

complex 

8u 

inverted kernel spectrum FFT(P*) 

complex 

8u 

data (^v 

uintlb 

2 V 

weights 

fioat 

4 V 

estimate ^ 

fioat 

4 

estimate spectrum FFT(^) 

fioat 

4 

temporary buffer 

fioat 

4 

22 u + 12 


SUPPLEMENTARY NOTE 2: IMPLEMENTATION 

1. Summary of optimization methods 

[Preibisch et al\{20\A\ derived a number of optimizations to make traditional Richardson-Lucy multi-view deconvolution more efficient. 

While the time required for one iteration remained unchanged, less iterations were required for achieving the same deblurring. The 
implemented variants all used the following formula, but replaced X with the expressions given below: 


• Independent: 


n 


vev 


' 0 ^ * Pv 


* X 


x = p: 


• Efficient Bayesian: 

x = p: jq p:*p^*p: 

wEWv 


• Optimization I: 

x = p: ]q p:*p^ 

veWv 


• Optimization II: 


x= n p; 

V^WEWy 


where is the estimate at iteration t, (j)v is the observed data of view u, Py is the PSF of view u, and P* is the dipped PSF of view v. 
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2. Convergence and number of iterations 

The optimizations derived in |Preibisch et al\ ( |2014| ) and listed above reduce the number of iterations the algorithm requires to converge. 
Convergence behavior of the different optimization variants were extensively studied in [Preibisch et aL\ p014| ) and apply likewise to our 
implementation. In practice, choosing the number of iterations is a trade-off between achieved quality and computation time. We therefore 
leave it to the user, who needs to make this decision based on the particular situation (e.g. if deconvolution is performed in real-time, a 
reduced number of iterations might be preferred for an increase in overall acquisition speed). To facilitate the decision, we provide a tool for 
interactively investigating different numbers of iterations on a single cross-section (see also the Fiji plugin manual). 


3. CUBA workflow for plane-wise multi-view deconvolution 

Our plane-wise multi-view deconvolution implementation uses multiple CUDA streams to overlap GPU computations with data transfer, 
such that not only copies to and from the GPU, but also loading and saving data from and to hard-drive come without additional cost. The 
implemented workflow is outlined below. Here, all processing and CUDA calls are asynchronous, i.e. non-blocking. Synchronization is 
achieved by calls to cudaStreamSynchronize(). 


Algorithm 1: Workflow to interleave disk I/O and memory transfers with data processing. 


nStreams = 3; 

for z ^ 0 to nStreams do 
Initialize streams [z]; 

Load plane from hard-drive into main memory; 

end 

for z ^ 0 to nStreams do 

stream = streams[z mod nStreams]; 
if z >= nStreams then 

eudaStr earn Synehronize{str earn ); 

Save deconvolved plane {z — nStreams) to hard-drive; 
Load plane z from hard-drive into main-memory; 

end 

Copy plane from main memory to GPU; 
for it ^ 0 to niterations do 
I Calculate Richardson-Lucy step on stream; 

end 

Copy plane from GPU to main memory; 

end 

for z ^ (nPlanes — nStreams + 1) to nPlanes do 
stream = streams[z mod nStreams]; 
eudaStreamSynehronize{stream); 

Save deconvolved plane z to hard-drive; 


end 


4. Libraries and dependencies 

To efficiently calculate the Richardson-Lucy iteration step, convolutions were replaced by multiplications in Fourier domain. Fourier 
transformations were computed using the cuFFT library (https://developer. nvidia.com/cuFFT). Other arithmetic operations were 
implemented as custom CUDA kernel functions. 

The entire workflow was implemented in the C programming language, using the CUDA speciflc extensions. The Fiji plugin was 
implemented in the Java programming language (Oracle Corporation). The C program was interfaced from Java using JNI (Java Native 
Interface). 

The deployed plugin contains for each platform the corresponding binary library, which is statically linked agains the CUDA SDK. 
Additionally, the cuFFT library is bundled, which is required as a shared library. 

Requirements for execution are a Nvidia graphics card that supports CUDA. 
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SUPPLEMENTARY TABLE 1: EXECUTION SPEEDS USING DIFFERENT GRAPHICS CARDS. 


Execution time (s) 

Graphics card 

512^ pixel 

1024® pixel 

2048® pixel 

Quadro K2000 

12.0 

83.7 


683.8 

Tesla C2075 

6.6 

48.2 


378.7 

GeForce GTX 680 

7.8 

29.8 


238.5 

Tesla K40c 

3.9 

19.4 


153.6 

GeForce Titan black 

4.0 

21.1 


152.3 


Table 1. Comparison of execution speed for plane-wise deconvolution using different graphics cards. 10 iterations have been performed deconvolving two 
views. Processing was performed on a Dell T6100 workstation (Intel E5-2630 @2.3 GHz 2 processors, 64 GB RAM). Data sizes correspond to the sizes 
padded for Fourier convolution, i.e. the sum of the actual data size and the size of PSF. 


SUPPLEMENTARY FIGURE 1: DECONVOLUTION RESULTS VIEWED ALONG THE DETECTION AXIS 



Supplementary Figure 1. Deconvolution results viewed along the detection axis, (a) Acquired data of the first view of a 9 hours post fertilization old 
Tg(h2aJva:h2aJva-mCherry) zebrafish embryo. (b,c) Multi-view deconvolution, performed (b) in full 3D and (c) plane-wise. 
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SUPPLEMENTARY FIGURE 2: COMPARISON OF VARIOUS FUSION METHODS USING A SIMULATED DATA 
SET 



Supplementary Figure 2. Comparison of various fusion methods using a simulated data set. (a) Original data resembling a single nucleus of a 
Tg(h2afva:h2afva-mCherry) zebrafish embryo, (b, c) Simulated view 1 and view 2, generated by convolving the original data with an elongated PSF in 
z- and ^-direction. Fusion by (d) averaging, (e) 3D multi-view deconvolution and (f) plane-wise multi-view deconvolution. 
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SUPPLEMENTARY FIGURE 3: COMPARISON OF DECONVOLUTION RESULTS UNDER VARIOUS NOISE 
LEVELS 




- Ground-truth 

- View 1 

-View 2 

- 3D deconvolution 

- Plane-wise deconvolution 


Supplementary Figure 3. Comparison of deconvolution results under various noise levels. Different amounts of Gaussian noise were added to the simulated 
data from Supplementary Fig. 2. For each signal-to-noise ratio (SNR), both views and the deconvolution results from the 3D deconvolution and our plane-wise 
implementation are shown, along the rotation axis (top row) and along the detection axis of view 1 (bottom row). For each SNR value, line profiles are 
shown of the ground truth, the simulated data and the deconvolution results in all three dimensions. Throughout all tested SNR values, results of plane-wise 
deconvolution closely resemble the results of the original 3D implementation. 
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SUPPLEMENTARY FIGURE 4: COMPARISON OF DECONVOLUTION RESULTS USING DIFFERENT PSFS 






- Ground-truth 

- View 1 

-View 2 

- 3D deconvolution 

- Plane-wise deconvolution 


Supplementary Figure 4. Comparison of deconvolution results using different PSFs. Simulated data were created as in Supplementary Fig. 2, using Gaussian 
PSFs with a fixed axial standard deviation cr^ of eight pixels, as determined empirically on our microscope. Different values were used for the lateral standard 
deviation axy. For each value, both views and the deconvolution results from the 3D deconvolution and our plane-wise implementation are shown, along 
the rotation axis (top row) and along the detection axis of view 1 (bottom row). Line profiles are shown of the ground truth, the simulated data and the 
deconvolution results in all three dimensions. While the results obtained by plane-wise and original 3D deconvolution are similar for small values of axy 
below a value of two, they start to diverge for higher values, axy on our microscopes was typically between 1.5 and 1.8 pixels. 
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